Determining the number of clusters using information entropy for mixed data

نویسندگان

  • Jiye Liang
  • Xingwang Zhao
  • Deyu Li
  • Fuyuan Cao
  • Chuangyin Dang
چکیده

In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results. & 2011 Elsevier Ltd. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Regional Evaluation of Hydrometric Monitoring Stations through Using Entropy Theory

Proper design and operation of monitoring systems for water resources management is one of themost important issues of water quality and quantity and accuracy and adequacy of data. The properevaluation of these data has a determining role in the correctand consistent decisions in the areacovered by the system. Therefore, determining proper distribution and number of monitoringnetwork stations a...

متن کامل

Oil Reservoirs Classification Using Fuzzy Clustering (RESEARCH NOTE)

Enhanced Oil Recovery (EOR) is a well-known method to increase oil production from oil reservoirs. Applying EOR to a new reservoir is a costly and time consuming process. Incorporating available knowledge of oil reservoirs in the EOR process eliminates these costs and saves operational time and work. This work presents a universal method to apply EOR to reservoirs based on the available data by...

متن کامل

Numerical Investigation of Nanofluid Mixed Convection and Entropy Generation in an Inclined Ventilating Cavity

This paper presents results of a numerical study of mixed convection and entropy generation of Cu–water nanofluid in a square ventilating cavity at different inclination angles. Except a piece of bottom wall with a uniform heat flux, all of the cavity walls are insulated. The inlet port is placed at the bottom of the left wall and the outlet port is positioned at the top of the right wall....

متن کامل

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

Entropy Generation of Variable Viscosity and Thermal Radiation on Magneto Nanofluid Flow with Dusty Fluid

The present work illustrates the variable viscosity of dust nanofluid runs over a permeable stretched sheet with thermal radiation. The problem has been modelled mathematically introducing the mixed convective condition and magnetic effect. Additionally analysis of entropy generation and Bejan number provides the fine points of the flow. The of model equations are transformed into non-linear or...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Pattern Recognition

دوره 45  شماره 

صفحات  -

تاریخ انتشار 2012